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ABSTRACT 

Motivation: Microarray experiments result in large 
scale data sets that require extensive mining and re- 
fining to extract useful information. We demonstrate 
the usefulness of (nonmetric) multidimensional scaling 
(MDS) method in analyzing a large number of genes. 
Applying MDS to the microarray data is certainly not 
new, but the existing works are all on small numbers (< 
100) of points to be analyzed. We have been develop- 
ing an efficient novel algorithm for nonmetric multidi- 
mensional scaling (nMDS) analysis for very large data 
sets as a maximally unsupervised data mining device. 
We wish to demonstrate its usefulness in the context 
of bioinformatics (unraveling relational patterns among 
genes from time series data in this paper). 
Results: The Pearson correlation coefficient with its 
sign nipped is used to measure the dissimilarity of the 
gene activities in transcriptional response of cell-cycle- 
synchronized human fibroblasts to serum [Iyer et al., 
Science 283, 83 (1999)]. These dissimilarity data have 
been analyzed with our nMDS algorithm to produce an 
almost circular relational pattern of the genes. The ob- 
tained pattern expresses a temporal order in the data 
in this example; the temporal expression pattern of the 
genes rotates along this circular arrangement and is re- 
lated to the cell cycle. For the data we analyze in this 
paper we observe the following. If an appropriate prepa- 
ration procedure is applied to the original data set, lin- 
ear methods such as the principal component analysis 
(PCA) could achieve reasonable results, but without 
data preprocessing linear methods such as PCA cannot 
achieve a useful picture. Furthermore, even with an 
appropriate data preprocessing, the outcomes of linear 
procedures are not as clearcut as those by nMDS with- 
out preprocessing. 

Availability: The fortran source code of the method 

used in this analysis ('pure nMDS') is available at 

http://www.granular.com/MDS/ 

Contact: tagygranular.com, yoono@uiuc.edu 



INTRODUCTION 

Each DNA microarray experiment can give us informa- 
tion about the relative populations of mRNAs for thou- 
sands of genes. This implies that without extensive data 
mining it is often hard to recognize any useful informa- 
tion from the experimental results. In this paper we 
demonstrate that a nonmetric multidimensional scaling 
(nMDS) method can be a powerful unsupervised means 
to extract relational patterns in gene expression. A data 
mining procedure may be useful, if it is flexible enough 
to incorporate any level of supervision, but we believe 
that the most basic feature required for any good data 
mining method is to be able to extract recognizable sig- 
nificant patterns reproducibly without supervision. In 



this sense our nMDS method is clearly demonstrated to 
be a useful means of data mining. 

Since MDS is a very natural tool to visualize the 
salient features in the data in general, initially we ex- 
pected many existing published works but to our knowl- 
edge actually there are not so many. Besides, these 
existing examples (only a few published papers in the 
bioinformatics area, e.g., Dyrskjot et al. (2002)), are 
mostly metric MDS examples and the number of points 
to be imbedded is small (about 30). Shmulevich and 
Zhang (2002) applies a nonmetric MDS method, but 
again they classify tumor types. That is, they study 
less than 100 points to embed (visualize). 

We have been developing an efficient nMDS tech- 
nique for large data sets (Taguchi and Oono, 1999, 
Taguchi et al, 2001, Taguchi and Oono, 2004). The 
input is the rank order of (dis)similarities among the ob- 
jects (in the present case, genes). Our algorithm is max- 
imally nonmetric in the sense that any introduction of 
intermediate metric coordinates obtained by monotone 
regression common to the conventional nMDS methods 
is avoided. 

Data compression is essentially a problem of lin- 
ear functional analysis as Donoho et al. (1998) stresses. 
In contrast, we believe that nonlinear methods should 
be important to data-mining. There are linear alge- 
braic methods such as the principal component analysis 
(PCA) for data mining, but it is expected that nonlin- 
ear methods are, in principle, more powerful than linear 
methods. The present paper illustrates this point. In- 
deed, in our case PCA cannot find any comprehensible 
pattern in low dimensional spaces without an appropri- 
ate data polishing. 

SYSTEMS AND METHODS 
Systems to Analyze 

The gene activities in transcriptional response of cell 
cycle-synchronized human fibroblasts to serum reported 

by Iyer et al. (1999) are analyzed. The microarray data 

used in this analysis is available at http://genome-www.stanford.edu/serv, 



Other Possible Analysis Methods 

To extract interpretable patterns from microarray data, 
cluster and linear multivariate analyses seem to be the 
two major strategies. However, these methods may not 
always work as shown here in the example of the anal- 
ysis of time series data. 

The cluster analysis seems to be the most common 



2 



Fig. 1 



approach (Slonim, 2002). For example, the hierarchi- 
cal clustering method seems to be popular (Eisen et 
al., 1998). Classifying the expression patterns as func- 
tions of time is often attempted by clustering methods 
(Spellman et al, 1998). However, it is difficult to see 
continuous relations among genes by cluster analysis, 
because it must classify genes into a limited number of 
categories. For example, it is difficult to see how genes 
are expressed temporally. In contrast to clustering anal- 
yses, nMDS can place genes in a continuous space, so 
we can easily recognize the relationship among genes 
that change in a continuous manner. Thus, the nMDS 
method can capture features which cluster analysis is 
hard to find. 

The principal component analysis (PCA) is a method 
that visualizes the relationship among genes in a con- 
tinuous vector space. The main idea is to choose a 
data- adapted basis set, and to make a subspacc that 
can capture salient features of the original data set. In 
principle, the method could capture the continuous re- 
lations in the gene expression pattern, but the dimen- 
sion of the subspace may not be low even if the data 
are on a very low dimensional manifold. In short, the 
information compression capability of linear methods 
is generally weak. This can be illustrated well by the 
data we wish to analyze in this paper: PCA does not 
capture the temporal order as clearly as nMDS. Figure 
1 exhibits the two dimensional space spanned by the 
first two principal components. The temporal expres- 
sion pattern is hardly recognized in the result. This 
should be compared with the nMDS results in Fig. 4. 

Apparently, however, some linear analysis methods 
work well. Holter et al. (2000) demonstrated that the 
singular value decomposition (SVD; a linear method) 
is remarkably successful in extracting the characteristic 
modes. The reader must wonder why there is a differ- 
ence between this result and the one due to PCA that 
is not successful. The secret is in the highly nonlinear 
'polishing' of the original data (proposed by Eisen et al. 
(1998)). However, the role of this nonlinear polishing 
must be considered carefully, because it can generate 
a spurious temporal behavior. Therefore, we relegate 
the comparison of these linear methods with data pre- 
processing and our nMDS to Appendix II. The salient 
conclusions are: 

(1) Linear methods such as PCA and SVD could per- 
haps achieve reasonable results, if a data preparation 
scheme is chosen appropriately. However, even the best 
linear results are generally fairly inferior to nonlinear 
results. 

(2) The data preparation such as the 'polishing' used 
by Holter et al. (2000) could actually corrupt the origi- 



nal data (as illustrated in Appendix II), and should be 
avoided. 



An interesting proposal to extract a temporal or- 
der is to use the partial least squares (PLS) regression 
(Johansson et al, 2003). In this case one may assume 
a temporal order one wishes to extract (say, a sinu- 
soidal change in time), and the original data are or- 
ganized around the expected pattern. This is, so to 
speak, to analyze a set of data according to a certain 
prejudice. Although during the process of organization 
no supervision is needed, the pattern to be extracted 
(that is, the 'prejudice') must be presupposed. Further- 
more, if a clear objective pattern could be extractable 
by this method, certainly nMDS can achieve the same 
goal without any presupposed pattern required by PLS. 

Metric multidimensional scaling methods (MDS) 
may also be used, but they depend on the definition 
of the dissimilarity. Therefore, unless the measure of 
the dissimilarity is (almost) dictated by the data or 
by the context of the data analysis, arbitrary elements 
are introduced. Even if a dissimilarity measure is al- 
ready given, if the measure is not positive definite (as 
in the case of the microarray data), the metric MDS re- 
quires its conversion to a positive definite metric. Thus, 
an arbitrary factor may further be introduced, and the 
metric supposedly capturing detailed information could 
carry spurious information (disinformation, so to speak) 
as well. Indeed, this extra factor can completely change 
the possible geometrical features in the data as can be 
seen from the following simple example. Take three 
samples A, B and C with the dissimilarities given as 
AB = -1, BC = 0, AC = 1. To convert them into 
positive numbers we shift the origin. If we add 2, A, 
B, C can be arranged consistently in ID. If we add 
1.5, the triangle inequality is violated. If we add 3, it 
can be geometrically realized but not in ID. nMDS is 
completely free from this problem, because adding a 
common number cannot change the ordering of the dis- 
similarities. Another disadvantage of metric MDS (and 
of linear multivariate methods) is that these methods 
are vulnerable to missing or grossly inaccurate data. 

The application of nonmctric MDS is a very nat- 
ural idea, but there are only a small number of pub- 
lished examples. Furthermore, to our knowledge there 
is no paper analyzing large number of objects by nMDS. 
Dyrskjot et al. (2002) is a rare example of applying 
nMDS, but, in contrast to ours, they never used MDS 
to analyze genes; they analyzed 40 tumors with a pack- 
age in SPSS. Our main purpose is to extract information 
on genes themselves. 
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The cluster analysis with the aid of self-organizing 
maps (SOM) is definitely a nonlinear data analysis meth< 
but as we have seen in Kasturi et al. (2003) it is not gen- 
erally suitable for extracting temporal order. Kasturi et 
al. comment that SOM is not particularly better than 
the ordinary cluster analysis. Besides, as can be seen 
from the fact that the use of a particular initial con- 
dition can be a methodological paper (Kanaya et al, 
2001), we must worry about the ad hoc initial-condition 
dependence of the results. 



ALGORITHM 

Basic idea of algorithm for nMDS 

The philosophy of nMDS (Shcpard 1962a, b, Kruskal 
1964a, b) is to find a constellation in a certain metric 
space 1Z of points representing the objects under study 
(genes in the present case) such that the pairwise dis- 
tances d of the points in 1Z have the rank order in closest 
agreement with the rank order of the pairwise dissimi- 
larities S of the corresponding objects that are given as 
the raw (or the original) input data. 

The conventional nMDS methods assume a certain 
intermediate pair distance d that is chosen as close as 
possible to d for a given object pair under the condition 
that it is monotone with respect to the actually given 
ordering of the dissimilarities S. The choice of d is not 
unique. The discrepancy between d and d is called the 
stress, and all the algorithms attempt to minimize it. 
Depending on the choice of d and on the interpretation 
of "as close as," different methods have been proposed 
(see, for example, Green et al. (1970), Cox and Cox 
(1994), and Borg and Groenen (1997) ). The choice of 
d affects the outcome, d is required only by technical 
reasons for implementation of the basic nonmetric idea, 
so to be faithful to the original idea due to Shepard 
(1962a, b) we must compare S with d directly. Our 
motivation is to make an algorithm that is maximally 
nonmetric in the sense that we get rid of d. 

The basic idea of this 'purely nonmetric' algorithm 
is as follows (Taguchi and Oono, 1999, Taguchi et al, 
2001, Taguchi and Oono, 2004): in a metric space 1Z (in 
this paper, D-dimcnsional Euclidean space R is used) 
iV points representing the N objects are placed as an 
initial configuration. For this initial trial configuration 
we compute the pair distances d(i,j), and then rank 
them according to their magnitudes. Comparing this 
ranking and that according to the dissimilarity data 
5(i,j), we compute an appropriate 'force' that moves 
the points in 1Z to reduce the discrepancy between these 
two rankings. After moving the points according to the 



'forces', the new 'forces' are computed again, and the 
Iwhole adjusting process of the object positions in 1Z is 
iterated until they converge sufficiently. The details are 
in Appendix I. 

nMDS can usually recover geometrical objects cor- 
rectly (up to scaling, orientation, and direction) when 
there are sufficiently many (say, > 30) objects. Further- 
more, even with metric data converting them into the 
corresponding rank order data may be a general strat- 
egy to extract robust features in the metric data. There- 
fore, nMDS is a versatile multivariate analysis method. 

It is desirable to have a criterion for convergence 
(analogous to the level of the stress in the conventional 
nMDS), or a measure of goodness of embedding. To 
this end let us recall the Kendall statistics K (p364, 
Hollander and Wolfe (1999)), 

K = ^2 si S n [( rf a - d p )(S a - 6p)], 

where the summation is over all the pairs of dissimilar- 
ities (distances between objects) {a, (3) (i.e., a (also (3) 
denotes a pair of objects) . Usually, this is used for a sta- 
tistical test to reject the null hypothesis that {d(i,j)} 
does not correlate with {5(i,j)} (The contribution of 
ties is usually negligible for large data set, so we do not 
pay any particular attention to tie data). 

Here, we use this value to estimate the number 
of the objects embedded correctly. If N' objects are 
correctly embedded, and if we may assume that the 
rest are uncorrelated, then K is expected to be 

n'(n' - l)/2 + 0[N 2 } > K > n'(n' - l)/2 - 0[N 2 }, 

where n' = N'(N' — 1)/2. This can be shown as follows. 
Since N' is correctly embedded, n' pairs must be cor- 
rectly ordered, so the contribution from these correct 
points to K is n' . We assume the rest is random. Their 
contribution is the random sum bounded by 0) 

that is of order ^/ Jv c 2 C l 2 — 0[N 2 ], if we assume that the 
central limit theorem applies. This gives the latitude in 
the above inequalities. If the embedding is successful 
for the majority of the objects, then n' — 0[N 2 ], so we 
may ignore the contribution of the bad points. Thus, 
we may estimate 



N' ~ \ 2V2K. 



Therefore, we adopt 100V / 2\72T/7V% as an indicator of 
the goodness of embedding. Although one might criti- 
cize that this is a crude measure of goodness, notice that 
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MDS does not have any clear criterion of the goodness 
of embedding except for the so-called stress. In contrast 
to the stress, N' gives an effective number of correctly 
embedded points. 

The plausibility of the above estimate of the num- 
ber of correctly embedded points may be illustrated as 
follows: We prepare a data set of 200 objects whose 
subset A consisting of M objects is embeddable in a 
3D space, but whose remaining 200 — M objects are 
not. The subset A consists of M randomly positioned 
points in 3D; their x, y, z coordinates are uniform ran- 
dom numbers in [—1,1]. The dissimilarity between ob- 
jects i,j S A is defined by 



$ij = Yg[(a* - Xj) 2 + {yi - Vj) 2 + {Zi - Zj) 2 ]. 

The dissimilarities between the points both of which 
are not in A are chosen to be the square root of ran- 
dom numbers uniformly distributed in [0, 2). With these 
choices, the variance of the dissimilarities are unity for 
both cases. We embedded these 200 objects into a 3D 

space using these dissimilarities for several values of M. 

Table. l| 

As can be seen from the results in Table 1, N' 
monotonically increases with M. The difference be- 
tween them decreases as TV — M decreases. Actually, 
N' - M oc - M. Thus, we can expect that N'/N 
is asymptotically an effective fraction of correctly em- 
bedded points in the N -> oc limit with (N - M) /N 
kept constant. 

We must also discuss the initial configuration de- 
pendence of the result. Our algorithm is not free from 
the problem of local minima as all of the previously 
proposed algorithms for nMDS and as high dimensional 
nonlinear optimization problems in general. However, 
generally speaking, this dependence has only a very mi- 
nor effect. This has been checked for the fibroblast data 
(see below). 

RESULTS 

We have found that the fibroblast data may be embed- 
ded in a two dimensional space approximately as a ring 
(Fig. 2). The estimated number of correctly embedded 
genes is about 480 among all the 517 genes (i.e., the 
goodness of embedding is more than 90%). Also 516 
out of 517 genes have P < 0.005 confidence level (see 
Appendix I). One might wonder that this is too lax a 
criterion, because we should apply a multiple compari- 
son criterion when we have so many a number of genes. 
However, the P- value is still very small: we still have 
515 genes with the P < 0.005 confidence level of correct 



embedding under multiple comparison criterion. That 
is, we need not change the gross upper bound estimate 
of P. 

Fig. 2 

The obtained configuration is insensitive to the ini- 
tial configurations. To see this we constructed two 2D 
embedding results starting from two different random 
initial configurations. With the aid of the Procrustean 
similarity transformation (Borg and Groenen, 1997) one 
result is fit to the other (notice that our procedure is 
nonmetric, so to compare two independent results, ap- 
propriate scales, orientations, etc., must be optimally 
chosen). Fig. 3 demonstrates the close agreements of 
x- and y-coordinates of the two results. As illustrated, 
the dependence on the initial conditions is very weak, 
and we may regard the embedded structure as a faithful 
representation of the information in the original data. 
Thus, we conclude that the obtained configuration is 
sufficiently reliable. 

Fig. 3 

This ring-like arrangement of the genes faithfully r— - 

represents the temporal expression patterns of the genes I 
as can be seen clearly from the rotation of the expression 
peaks around the ring (Fig. 4a). It is noteworthy that 
the angle coordinate assigned to the genes according to 
the result shown in Fig. 2 automatically gives the figure 
usually obtained only through detailed Fourier analy- 
sis (Fig. 4b). These figures attest to the usefulness of 
nMDS, a nonlinear data mining method, for extracting 
nontrivial patterns in large scale data sets. 

One might criticize that the temporal pattern just 
mentioned is rather trivial and can be expected intu- 
itively, because the temporal pattern may be recog- 
nized, if we pay attention to the peaks only. However, 
the obtained ordering of the genes along the ring (i.e., 
the angle coordinate) cannot simply be obtained by or- 
dering genes according to their peak times alone. There 
are genes that exhibit peaks only at one time, so there 
is no way to order these genes sharing the peak time 
(just as there is no way to order genes in one cluster 
in cluster analyses without inspection). The detailed 
ordering obtained by nMDS reflects biologically mean- 
ingful events as discussed below. 

Most genes directly related to cell cycle have peaks 
at 24 hr. All of them are tabulated in Table 2; the infor- 
mation of the genes were obtained with the aid of Ace- 

View (http:/ /www. ncbi.nlm.nih.gov/IEB/Research/Acembly/). 
Thus, we seem to observe the transition from G2 to Gl 
through M phase along the ring between angle vari- 
ables 1.8 and 3 (needless to say, the angle coordinate 
difference of order 0.1 is not significant). That is, the 
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ordering found by nMDS is consistent with some known 
functions of genes. Since the starved fibroblasts are sup- 
posedly in Gl, perhaps one full cell cycle may have been 
captured in the original data. This may explain why the 
peak positions make one complete rotation. One might 
question that this full rotation is simply due to the use 
of all the data. Even if we use the data only up to 16hr 
from the start, we can actually almost reproduce the 
gene configuration exhibited in Fig. 2. Obviously, the 
peaks up to time 16 hr cannot complete any full circle 
around it. Thus, the observed full rotation in Fig 4a 
is not an artifact of our data analysis. Thus, we may 
conclude that nMDS captures, as cluster analyses do, 
major expression patterns of the genes but with a more 
natural time ordering in the current example. 



Table. 2 



Fig. 5 



Although there is a recent paper claiming that the 
data we are analyzing can provide information about 
the gene ontology (Laegreid et al. (2003)), we do not 
think the data can be used for this purpose (Appendix 
III). 

As has been clearly demonstrated, the 2D embed- 
ding is statistically natural and biologically informative. 
Still the 2D embedding is not perfect, so it is interesting 
to see what we might obtain by 'unfolding' the 2D data, 
adding one more axis. The unfolded result is shown in 
Fig. 5. Here, the angular coordinates <f> and 6 of the 
spherical coordinate system is determined by the xy- 
plane whose a;-(resp., y-)axis is the first (resp., the sec- 
ond) principal component of the 3D embedded result. 
The total contribution of these two components is 86%. 
We do not recognize any clear pattern other than that 
captured in the 2D space. Therefore, we may conclude 
that the 2D embedding result is sufficiently reliable and 
informative. 



However, one might wish to be more quantitative. 
Traditionally within the multidimensional scaling meth- 
ods, there is no very clear way to determine the dimen- 
sionality of the embedding space. Here, we propose a 
method utilizing the correlation among principal coor- 
dinates. Basically, the strategy is as follows. Suppose 
we embed identical data into n-dimensional and (n+1)- 
dimensional spaces. Using the embedded results, we can 
construct principal axes with the aid of PCA. Then, we 
study the correlation coefficients of the coordinates of 
the points. It is usually the case that the first n prin- 
cipal axes of the (n + l)-dimensional embedding result 
agree well (have high correlation coefficients) with the 
n principal axes of the n-dimensional embedding result. 
Now, we see the correlation for the (n + l)th axis of the 
(n + l)-dimensional embedding result and n principal 
axes obtained from the n-dimensional embedding. If the 



correlations are low enough, we may say n-dimensional 
space captures the main features in the data. We ap- 
ply this to n = 2. In the following table nDX denotes 
the Xth principal axis for the n-dimensional embedding 
(Table 3). As can be seen from the table, we may con- 
clude that 2D is enough to capture the main features in 
the data. We could even devise a statistical test based 
on the correlation coefficients, but we do not go into 
this further in this paper. 

In this paper we have demonstrated that nMDS has 
a capability of extracting a significant pattern without 
any supervision or preprocessing. However, the data 
we have analyzed is a very high quality well structured 
data. A natural question is what nMDS can offer for 
less well structured data. 

As an example, we here show the nMDS analysis 
result of Cho et al. (2001) on the cell cycle of human 
foreskin cells. Their data were analyzed in detail by 
Shcdden and Cooper (2002). According to their analy- 
sis periodicity of the genes in Cho et al. is statistically 
insignificant; the data randomly permuted along the 
time axis exhibit periodic patterns of similar or greater 
strength than the original experimental data. As was 
done by Shedden and Cooper, we have selected 300 peri- 
odic genes from both original data and randomly time- 
reordered data. One should keep in mind that it was 
impossible to distinguish these two data by inspection. 
The two upper figures in Fig. 6 is the results w hen we 
applied our nMDS to these two data sets. 



Fig. 6 



First, one can notice that both of them exhibit 
ring-like structures. It is natural because periodic genes 
are selected. However, in contrast to the analysis by 
Shcdden and Cooper, we can detect some differences 
between these two figures. Along the ring-like structure 
in the original data, the distribution of the genes is not 
uniform and has two dense regions placed diametrically 
opposite to each other. On the other hand, such a struc- 
ture cannot be seen in the randomized data. Needless 
to say, the forced periodicity by an artificial selection 
of the data is detected in both cases as a ring struc- 
ture, so one might conclude, as Shedden and Cooper 
did, the observed periodicity is not statistically signifi- 
cant. However, the difference between the original data 
and the randomized data clearly tells us the conclusion 
is premature. 

To highlight the capability of nMDS, the same se- 
lected data were analyzed by PCA with the preprocess- 
ing discussed in Appendix II (the data standardization 
by Method II) (the two middle figures in Fig. 6). We 
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see much more diffuse rings than obtained by nMDS. 
It is however interesting to note that PCA captures the 
nonuniformity of the distribution of genes noted in the 
above. 

Thus, we may conclude that nMDS has a capabil- 
ity of detecting subtle differences. In the example here, 
this could be detected by PCA as well (with normal- 
ization), but it is clear that nMDS is much more crisp 
than PCA. However, the data analyzed here are not re- 
ally the original data, but a subset of the original data 
selected by the good correlation to sinusoidal time de- 
pendence. Therefore, as duly criticized by Shedden and 
Cooper, the periodicity in the analyzed data can largely 
be an artifact. 

Therefore, to avoid this possible artifact we made 
a 300 gene subset consisting of genes with largest ex- 
pression level variations to avoid possible noise effects. 
The 2D embedding result of these genes with the aid of 
nMDS is shown in the bottom left of Fig. 6. Although 
diffuse, a ring-like structure is still visible. More inter- 
estingly, two distribution peaks diametrically placed on 
the ring are clearly visible. If we randomly permuted 
the same data along the time axis, the embedded result 
(bottom right) loses both the ring-like structure and the 
peaks. Therefore, we may conclude that the original 
data contains some information about the cell-cycle re- 
lated gene expression contrary to the conservative con- 
clusion by Shedden and Cooper. We have no intention 
to criticize their study that was admirably critical and 
careful; we simply wish to demonstrate that nMDS can 
detect subtle structures. 

Iyer et al. selected the genes to analyze according to 
the significance of the changes in their expression levels, 
so the temporal structure detected by our nMDS anal- 
ysis is not an artifact of the methodology in contrast to 
that exhibited in Cho et al. (2001). 

CONCLUSION 

We have demonstrated that the nMDS can be a useful 
tool for data mining. It is unsupervised, and perhaps 
maximally nonlinear. Our algorithm is probably the 
simplest among the nonmetric MDS algorithms and is 
efficient enough to enable the analysis of a few thousand 
objects with a small laptop machine. 

The nMDS algorithm works on the binary relations 
among the objects, so if there are N objects, computa- 
tional complexity is of order N 2 at least. Therefore, it 
is far slower than linear methods such as PCA, although 
our nonlinear algorithm is practically fast enough; we 



have used for this work a small laptop machine. (Mobile 
Celeron 650MHz cpu with 256MB RAM). Typically, the 
necessary number of iteration is about a hundred, and 
it takes only a few minutes for a few hundreds of genes. 
As has been pointed out and illustrated, with an appro- 
priate data preprocessing a certain linear method could 
give us a reasonable result with less computational ef- 
forts. Although in this paper we have not made any 
particular effort to reduce computational requirements, 
a practical way to use nMDS might be to prepare an 
initial configuration by a linear method with an appro- 
priate data preprocessing method that is verified to be 
consistent with the full nMDS results. 

APPENDIX I 

'Purely' non-metric MDS algorithm 

Suppose d(i,j) is the distance between objects i and j 
in 1Z. Let the ranking of S(i,j) among all the input 
dissimilarity data be n and that of d(i,j) among all the 
distances between embedded pairs be T n . If n > T n 
(resp., n < T n ), we wish to 'push' the pair i and j far- 
ther apart (resp., closer) in 1Z. Intuitively speaking, to 
this end we introduce an 'overdamped dynamics' of the 
points in 1Z driven by the following potential function 

A^]T(T„-n) 2 . 

Here, the summation is over all the pairs. In the ac- 
tual implementation of the algorithm, simpler forces 
are adopted than the one obtained from this potential 
as seen below, because the latter is complicated for nu- 
merical simulation to the extent of being not practical. 
There are many ways to modify A to make the force 
simpler; we have selected the one that makes the force 
formula very simple, if not the simplest. 

The A defined by the formula above may be re- 
garded as a counterpart of the stress in the conventional 
nMDS. As wc will see later we can use quantities related 
to A to evaluate the confidence level of the resultant 
configuration. However, we do not employ A itself, be- 
cause this quantity is not the difference between two 
ranks of independent objects; there are or dij 

but they are relations among only N potentially inde- 
pendent quantities. Therefore, the statistics of A is not 
known. 

Thus, although certain modifications as discussed 
above are needed, still in our nMDS algorithm, the op- 
timization process is closely connected to a process that 
improves the confidence level of the resultant configu- 
ration. 

The 'pure nMDS' algorithm for N objects may be 
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described as follows: 



1. Dissimilarities Sij — 1, • • • , N) for N objects 
are given. Order them as follows: 

■••<%< Ski < ■■■ ■ 

2. Put N points randomly in 1Z as an initial config- 
uration. 

3. Scale the position vectors in 1Z such as v/J^ l r il 2 = 
1, where r, is the current position of object i in 
K. 

4. Compute dij for all object pairs (i,j) in 7Z, and 
then order them as 

• • • < dij < dki < ■ ■ ■ ■ 

5. Suppose Sij is the mth largest in the ordering in 
^ and dij is the T TO th largest in the ordering in 
0] Assign dj — T m — m. Calculate the following 
displacement vector for i: 

where s = 0.1 x A~ 3 typically, and update r; — > 
T{ + Sri. 

6. Return to 3, and continue until the "potential en- 
ergy A" becomes sufficiently small. 



The reader may worry about the handling of tic 
data. Generally speaking, for a large data set the frac- 
tion of tie relations is not significant; furthermore, if the 
result depends on the handling schemes of tie data, the 
result is unreliable anyway. Therefore, we do not pay 
any particular attention to the tie data problem. 

In the above algorithm, s is a constant value. In 
practice, we could choose an appropriate schedule to 
vary s as is often done in optimization processes. In 
this paper, for simplicity, we do not attempt such a fine 
tuning. 

In the above algorithm, we can deal with asymmet- 
ric data as well, i.e., Sij =/= Sji if we compare Sij with dij 
while Sji with dji(= dij). Needless to say, if the mis- 
match between Sij and Sji is large, then representing 
the pair by a pair of points in a metric space is ques- 
tionable. Therefore, we will not discuss this problem 



any further in this paper. The microarray data ana- 
lyzed in this paper do not have such a problem in any 
case. 

Goodness of embedding 

In the text we have already discussed the effective num- 
ber of correctly embedded objects as a measure of 'global 
goodness of embedding.' This measure, however, can- 
not tell us the embedding quality of each object. It 
is often the case that the majority of objects are em- 
bedded well even without sensitive dependence on the 
initial conditions, but there are a few objects that con- 
sistently refuse to be embedded stably. To judge the 
quality of embedding for each object j we define 

A(i) = X)[ T »0-)0')-n(j')] 2 . 

Here, n(j) is the rank order of S(i,j) among N — 1 
pairs for a given j, and T n ij\(j) is the rank order 
of d(i,j) among N — 1 pairs for the same j. 

A(j) can be regarded as a statistical variable for 
the relative position of the j-th object with respect to 
the remaining objects (Lehmann 1975). We can esti- 
mate the probability P(e) of A(j) < e with the null hy- 
pothesis that the rank ordering of dij (i G {1, 2, • • • , N}\ 
{j}) is totally random with respect to the rank order- 
ing of S t3 (i e {1, 2, • • • , N} \ {j}). If N is sufficiently 
large, then A(j) obeys the normal distribution with 
mean (M 3 — M) /6 and variance M 2 (M + 1) 2 (M- 1)/36, 
where M = N — 1. For smaller N there is a table for 
P(e) (Lehmann 1975). Thus, we can test the null hy- 
pothesis with a given confidence level for j-th object. 

APPENDIX II 

Limitations and capabilities of linear methods 

The limitations and capabilities of PCA with and with- 
out data preprocessing are illustrated in this appendix. 
There is no fundamental difference between PCA and 
SVD. We consider the following artificial data {s g t}, 
where g (= 1, • • • , 517) denote genes and t (= 1, • • • , 11) 
the observation times: 
[Data set 1] 

sj f = C g cos(27rf/ll + 2irS g ). 

[Data set 2] 

4t = ex P(4t)- 

[Data set 3] 

s 3 gf = C g i cos(27rf/ll + 2irS lg ) 

+C g2 exp[cos(27rf/ll + 27r5 2 g)} 
+ C ff3 /cos(27rt/ll + 27r5 3ff ). 
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Fig. 7 



In the above, C g ,S g ,Ci g ,5i g ,(i — 1,2,3) are uniform 
random numbers in [0, 1]. That is, Data set 1 is a set of 
sinusoidal waves with random amplitudes and phases, 
Data set 2 is the nonlinear ly distorted Data set 1, and 
Data set 3 is a set of periodic functions that are very 
different from simple oscillatory behaviors. 

These data sets are analyzed by the following meth- 
ods. 

Method 1: PC A with the preprocessing used by Holter 
et al. (2000). The preprocessing procedure is as follows: 
step 1: Subtract the average, 

s gt — s gt ~ ( s gt)g,t, 

where (*) g ,t is the average over all genes and experi- 
ments, 

y *• 

\')g,t = 7- 

step 2: (Column normalization) Normalize the data as 



'gt 



'gt 



EgW 



step 3: (Row normalization) Normalize the data as 

s" 



s gt ~ 



'gt 



Repeat these steps until the following condition is sat- 
isfied, 

\/{[Sgt-S^ t ] 2 ) g , t <0.01. 

From the resultant s gt correlation matrix Matr.(Cor U ') 
is constructed, and then PCA is performed. 



Method 2: PCA with the preprocessing so that J2 t s gt — 
and J2 t s gt = 1 for all g. Of course, no iteration is 
needed for this preprocessing. From the resultant s gt 
correlation matrix Matr.{Cor t t') is constructed, and 
then PCA is performed. 

Method 3: nMDS as done in the text. That is, the 
negative of the correlation coefficient Cor gg > is used 
as the dissimilarity and nMDS is applied straightfor- 
wardly. Needless to say, no preprocessing of data is 
needed. 

The results are exhibited in Figure 7. The conclu- 
sions may be: 



(1) For Data set 1, any method will do. 

(2) For Data set 2, the procedure recommended by 
Holter et al. (2000) fails, although ironically simpler 
Method 2 still works very well. If the amplitude C 
is distributed in [0, 5] instead of [0, 1] (that is, the ex- 
tent of the nonlinear distortion is increased), Method 
2 becomes inferior to Method 3, but still Method 2 is 
adequate. 

(3) For Data set 3, even Method 2 fails. nMDS (Method 
3) still exhibits a ring-like structure. The method rec- 
ommended by Holter et al. (2000) is obviously out of 
question. 

Thus, we may conclude that nMDS is a versatile 
and all around data mining method for analyzing peri- 
odic temporal data. Furthermore, we can point out that 
the preprocessing method in Method 1 should not be 
used, because it could severely distort the original data 
(as may have been expected from the figures) . Suppose 
there are N genes and 4 time points. Consider the fol- 
lowing example (for the counterexample sake) . The first 
gene has (a, b, —b, —a) (a > b > 0), and the remaining 
genes are all give by (1,0,0,-1). The N x 4 matrix 
made from these vectors is polished by an iterative row 
and column vector normalization procedure. If N is 
sufficiently large, the first row converges to (0, 1, —1, 0) 
and the rest to (1,0, 0, —1), independent of a and b. If b 
is small, then all the vectors should behave almost the 
same way, but after polishing the out-of-phase compo- 
nent in the discrepancy between the first row and the 
rest is dramatically enhanced, resulting in a spurious 
out of phase temporal behavior. Although the preced- 
ing exercise is trivial, the result warns us the danger of 
using the so-called polishing. 

APPENDIX III 
Gene ontology 

Since we have seen Laegreid et al. (2003) claiming that 
gene ontology can be discovered in the same data we are 
analyzing, we tried to extract more biological informa- 
tion from the nMDS result. We have realized that the 
distribution of ontologically identified genes (or that of 
the distribution of 23 ontology tags adopted by Laegreid 
et al.) along the ring is statistically indistinguishable 
from the uniform distribution. In this sense, nMDS fails 
to capture ontological information of the genes. This 
might suggest that our nMDS method is definitely in- 
ferior to the analysis based on the rough set adopted 
by the above paper. However, we have found that the 
claimed success is largely illusory. The claimed success 
is: of the 24 genes for which homology information al- 
lows inference of their ontology, "11 genes had one or 
more classifications that matched this assumption (Ta- 
ble 10)." However, actually 8 ± 2.4 is within one a 
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error (note that out of 210 genes used to train the rules 
based on the rough set, each gene on the average share 
an ontology with about 70 other genes). That is, 11 
out of 24 is statistically hardly significant (the cross 
validation does not mean very much because it is only 
a consistency check of the method). Therefore, nMDS 
does not fail to capture the significant information that 
can be obtained by the existing statistical methods. 

Thus, we may conclude that the data we have an- 
alyzed may be sufficiently informative to follow the cell 
cycle related events, but not to annotate genes ontolog- 
ically. 
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Table legends 



Table 1 : The dependence of the effective number N' of correctly embedded points upon the number M of 
geometrically embeddable points. The N' = 100 case may reflect the fact that N' is a good estimate only for 
sufficiently many embeddable points. 

Table 2: Relationship between gene functions and the two-dimensional arrangement. 

Table 3: The correlation coefficients between principal axes of embedding in D-dimcnsional space. For detail, 
see text. 
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Figure legends 



Figure 1: PCA results using correlation coefficient matrix. The first two principal components are used as the 
horizontal and vertical axis, respectively (the cumulative proportion is 70 %). Genes whose experimental values 
are larger than 3.2 are drawn using filled boxes, otherwise, using small dots (the corresponding color figure is 
available online). From the top the time is, respectively, 15 min, 30 min, 1 hr, 2 hr, 4 hr, 6 hr, 8 hr, 12 hr, 16 
hr, 20 hr, 24 hr. The figures are arranged in two columns, but this is solely for the layout purpose; there is no 
distinction between two columns. 

Figure 2: Two dimensional embedding result obtained by nMDS. 

Figure 3: Comparison between the nMDS embedding results with two different initial configurations after Pro- 
crustean similarity transformation. The horizontal (resp., vertical) coordinates are compared in the left (resp., 
right) figure. In each figure x-axis corresponds to the result from one initial condition and the j/-axis the other. 

Figure 4: (a) Temporal patterns of gene expression levels visualized with the aid of nMDS. Genes whose exper- 
imental values are larger than 1.2 are drawn using filled boxes, otherwise, using small dots Time sequences are 
the same as explained at Fig. 1. (b) Gene expression data as a function of the angle measured from the vertical 
axis in (a). The horizontal axis corresponds to t. Genes whose experimental values are larger than 1.6 are drawn 
using filled boxes, otherwise, using small dots The figures are arranged in two columns, but this is solely for the 
layout purpose; there is no distinction between two columns. 

Figure 5: 3D unfolding of the temporal pattern of gene expression level with the aid of nMDS (3D). Experimental 
values are normalized as explained in Fig. 3. Genes whose experimental values are larger than 1.6 are drawn 
using filled boxes, otherwise drawn using small dots (the corresponding color figure is available online). The 
horizontal (resp., vertical) axis represents (j) (resp., 9). See the text for detail. 

Figure 6: Example of less well structured data. Upper row: nMDS; Middle row PCA with normalization (Method 
II in Appendix II); Lower row: Most expressive/depressive genes (with larger variance): Left column: original 
data; Right column: randomized data. For detail, see text. 

Figure 7: Comparison of linear and nonlinear methods. 

Method 1: PCA with polishing (Holter et al. 2000); Method 2: PCA with normalization; Method 3: 2D space 
embedding with the aid of nMDS. See the text for Data sets and Methods. For Methods 1 and 2, horizontal and 
vertical axes are the first and second principal components, respectively, and the percentages describe cumulative 
proportions. For Method 3, the percentages are the indicators of goodness defined in the text. 
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Table 1 Taguchi & Oono 



M N' N'-M y/N-M 



180 


183.6 


+3.6 


4.5 


170 


174.8 


+4.8 


5.5 


150 


158.4 


+8.4 


7.1 


120 


127.9 


+7.8 


8.9 


100 


89.6 


-10.4 


10.0 



Tabic 2 Taguchi & Oono 





Gene IN um- 








Angle 


ber in Iyer et 
al. (1999) 


Gene identification 


cell cycle 


notes 


1.83 


332 


RRM1 


S 


DNA synthesis 


1.84 


342 


Cyclin Bl 


G2/M 


predominantly during G2/M phase 


1.85 


331 


centromere protein F 


G2-M 


in late G2 through early anaphase; the prod- 
uct associates with the kinetochore. 


1.90 


339 


CDC2 


G2/M 


essential for Gl/S and G2/M transitions con- 
trolled by cyclin accumulation and destruction 
through the cell cycle. 


1.91 


328 


MAD2L1 


M 


a component of the mitotic spindle assem- 
bly checkpoint that prevents the onset of 
anaphase until all chromosomes are properly 
aligned at the metaphase plate. 


1.92 


333 


CCNA2 


G2/M 


transition Gl/S and G2/M. This cyclin binds 
and activates CDC2 or CDK2 kinases, and 
thus promotes both cell cycle Gl /S and G2 /M 
transitions. 


1.93 


329 


yanuru 


M 


encoding proliferation-associated nuclear pro- 
tein; associates with the spindle pole and mi- 
totic spindle during mitosis. 


1.95 


341 


CKS2 


G2/M 


encoding CDC28 protein kinase regulatory 
subunit 2. CKS2 protein binds to the cat- 
alytic subunit of the cyclin dependent kinases 
and is essential for their biological function. 


2.10 


308 


CENPE 


M 


encoding centromere protein E, 312kDa; first 
appears in prometaphase of M. 


2.51 


306 


LBR 


M 


encoding lamin b receptor; chromatin organi- 
zation. 


2.64 


204 


CDKN2C 


Gl 


encoding cyclin-dependent kinase inhibitor 2C 
(pl8, inhibits CDK4); controls Gl. 


2.68 


319 


CNAP1 


M 


encoding chromosome condensation-related 
condensin homolog. 



15 



Table 3 Taguchi & Oono 
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3DII 
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0.998 


0.014 


0.005 


2DII 


-0.014 
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-0.027 
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Figure 1 Taguchi & Oono 
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Figure 2. Taguchi & Oono 
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Figure 3 Taguchi & Oono 
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Figure 4 (Color) Taguchi & Oono 
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Figure 6 Taguchi & Oono 
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Figure 7 Taguchi & Oono 
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Figure legends for online only color figures: 



Figure 1 (Color; online only): PC A results using correlation coefficient matrix. The first two principal components 
are used as the horizontal and vertical axis, respectively (the cumulative proportion is 70 %). Colors indicate 
relative intensity of experimental values (red > 3.2, yellow > 2.4, green > f .6, pale blue > 0.8, gray < 0.8). From 
the top the time is, respectively, 15 min., 30 min., 1 hr, 2 hr, 4 hr, 6 hr, 8 hr, 12 hr, 16 hr, 20 hr, 24 hr. 

Figure 4 (Color; online only): (a) Temporal patterns of gene expression levels visualized with the aid of nMDS. 
Colors indicate relative intensity of experimental values normalized so that ^ t s gt = and J2t s 1t — 1 where 
s g t is experimental variable of 5th genes at time t. (red > 1.6, yellow > 1.2, green > 0.8, pale blue > 0.4, gray 
< 0.4). Time sequences are the same as explained at Fig. 1. (b) Gene expression data as a function of the angle 
measured from the vertical axis in (a). The horizontal axis corresponds to t. The color convention is the same as 
in (a). The figures are arranged in two columns, but this is solely for the layout purpose; there is no distinction 
between two columns. 

Figure 5 (Color: online only): 3D unfolding of the temporal pattern of gene expression level with the aid of 
nMDS (3D). The color convention is the same as explained in Figure 3. The horizontal (resp., vertical) axis 
represents <j> (resp., 8). See the text for detail. The figures are arranged in two columns, but this is solely for the 
layout purpose; there is no distinction between two columns. 
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Figure 4 (Color) Taguchi & Oono 
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